One possible solution

The question is, can anything be learned from playing random games? The answer is yes for the following reason. There will be some moves which are good no matter when they are played, if a player actually gets to play them. For example, if a certain move of Black makes a group of Black stones live, but if White prevents that move, Black's group dies, the average result of the games differs by the corresponding value.

A general rule taught to novices in go is to ``always look for the biggest move on the board''. One separates the situation on the go board into local move exchanges and assigns to the move initiating play in a region the estimated benefit explicitly in the number of points gained or lost. The suggestion is to play the move with the best value first. Of course, this can only be a rule of thumb because often the moves are not completely independent of each other. Furthermore, it is important which player will have the initiative after a sequence of moves (sente), i.e. a smaller move which does not loose sente can be played first. For our goal it is of interest that go nevertheless allows such localization to a high degree, much more so than it is for example the case in chess.

Under these circumstance, playing random games to the end is not a wasted effort since good moves may be detected. To identify good moves whenever they happen in a game tree is also a standard tool of pruning. The α-β algorithm needs the minimal number of moves if at each branching the best move is tried first. The so-called killer heuristic suggests to try those moves first which are currently the best moves anywhere in the tree, and significant improvements can be achieved [8].

For these reasons we adopt the following strategy for playing random games. Each player decides beforehand in what order he wants to play his moves (taking into account that pieces may be played several times onto the same field after captures). A game is played to the end such that if a move in the predetermined list is not possible, the next one is used. Each move is assigned the average value of all the games in which it was played. This value is updated after every game. After every update the moves are ordered according to their average values. The strategy is therefore that moves which are most successful on average are deemed most important to be played first. After a large number of games both Black and White will have settled on a sequence of moves which are most successful with regard to the other. The best move for the one to move first is the first one in his list.

As described so far, we are considering a greedy algorithm which will tend to remain at local extrema. The modification for simulated annealing is to introduce a finite probability for a move to be played out of order which depends on the exponential of the value difference divided by a parameter playing the role of temperature. As before, when the temperature becomes zero, we have frozen all deviations from the greedy strategy.

The algorithm given in the introduction provides the framework for several variations of this idea. The above construction could be called the zeroth order approach in the following sense. Obviously, there are some moves which are only good as answers to particular other moves. To incorporate such interdependencies one can store not just the average value of each move but the average value after a particular other move was played first. This could be called the first order approach, and there clearly is a generalization to n-th order. In the limit of large n the complete game information has been stored and the method becomes equivalent to an exhaustive tree search. This means that in the limit the method works perfectly (if inefficiently).

The principle limitation is that the more information we want to use from the games that are played, the more games have to be played to make this information meaningful. A general fact about the statistics of the data is that the error in the average value Δv is typically proportional to one over the square root of the number of games n,

Δv$\displaystyle {\frac{{1}}{{\sqrt{n}}}}$. (2)
(Other numerical methods work usually much better.) This implies that to determine the value of a move to within 1 point when the value in the random games fluctuates by 100 we have to play on the order of 10000 games! For the actual tests presented below a few hundred games turned out to be feasible. Therefore, higher order considerations have to be postponed.

What our algorithm is expected to accomplish is to play perfect games against dumb opponents. The games can be made perfect by collecting enough statistics (cooling slowly enough). How dumb the opponent is depends on how much information can be extracted from the games. Playing a large number of games against a strong, perhaps human opponent would let the algorithm play a better game against any opponent. The problem is, of course, that the only opponent available is the algorithm itself.

This concludes our motivation for the algorithm introduced in the introduction. The crucial point of our proposal for applying simulated annealing to tree searches in go is to replace the strict move order (as familiar from combinatorial optimization) by the concept of timeliness or priority of each possible move. Hopefully, this representation of a tree search leads to changes of configuration that simulated annealing can manage. The other novel aspect (as far as computer go is concerned) is to try to utilize the strengths of simulated annealing by playing complete games as a substitute for go heuristics. Enough of theoretical musings, does it work in practice?